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<^> Abstract 



In several application domains, high-dimensional observations are collected and then analysed in 
search for naturally occurring data clusters which might provide further insights about the nature of 
l/~) ■ the problem. In this paper we describe a new approach for partitioning such high-dimensional data. 

Our assumption is that, within each cluster, the data can be approximated well by a linear subspace 
estimated by means of a principal component analysis (PC A). The proposed algorithm, Predictive 
Subspace Clustering (PSC) partitions the data into clusters while simultaneously estimating cluster- 
wise PCA parameters. The algorithm minimises an objective function that depends upon a new 
measure of influence for PCA models. A penalised version of the algorithm is also described for carrying 

our simultaneous subspace clustering and variable selection. The convergence of PSC is discussed in 

> 

jy-v , detail, and extensive simulation results and comparisons to competing methods are presented. The 



comparative performance of PSC has been assessed on six real gene expression data sets for which PSC 
often provides state-of-art results. 

1 Introduction 



In recent years, vast amounts of digital information has been generated which continues to grow ever 
more rapidly. For instance, within the realm of genomic research, high-throughput microarray technologies 
provide quantitative measurements of tens of thousands of genes for each biological entity under observation, 
such as a sample tissue; these measurements are then analysed to identify gene expression patterns that 
are predictive of a clinical outcome or to detect sub-populations. A number of other technologies, such 
as those developed for digital imaging, surveillance, and e-commerce, amongst many others, all generate 
observations which can be seen as random vectors living in a large dimensional space. 

Very often, even though the available observations are high dimensional, it has been found that their 
intrinsic dimensionality is in fact much smaller. For instance, although the number of pixels or voxels in a 
digital image can be extremely large, it is often the case that only a few important dimensions are able to 
capture some salient aspects of the available images, and those are sufficient to detect meaningful patterns. 
In genomics, despite the fact that the expression levels of tens of thousands of genes are being measured, a 
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much smaller number of dimensions may suffice to identify different underlying biological processes which 
characterise the available samples. 

The abundance of such high-dimensional data and the fact that their low-dimensional representations 
are often intcrprctable and informative has caused projection-based dimensionality reduction techniques to 
become popular. When each data point is treated as an independent realisation from a given distribution 
with support in a low-dimensional subspace, Principal Component Analysis (PCA) is commonly used 
to recover the low-dimensional representation of the data [12] . PCA and related techniques, however, 
assume that the available data points are drawn from a single underlying distribution that is typical of 
the population from which the observations have been obtained. Whereas this can be a valid assumption 
in some cases, there are also many applications in which the underlying population is suspected to be 
highly heterogeneous; in those cases, each observation may have been drawn from one of many alternative 
distributions, and a single low-dimensional representation of the data would not be sufficient. 

Partitioning the sample data points according to the low-dimensional subspace that best describes them 
has become an important research question in various domains. In cancer genomics, for instance, the bio- 
logical tissues for which high-dimensional gene expression profiles have been observed may be representative 
of a number of cancer sub-types; the identification of these sub-populations is important because each one 
of them may be associated with a distinct clinical outcome, such as life expectancy after therapy [34] . In 
the analysis of digital images of human faces, each subspace may be associated with a particular person 
for which a number of images have been collected, for instance under different lighting conditions, and the 
identification of these subspaces will aid face recognition [16 . When the observations are deemed to be 
representative of K distinct low-dimensional subspaces, the problem of subspace clustering then consists in 
the simultaneous estimation of these K subspaces and the unobserved membership of data points to sub- 
spaces [29] . A formal definition of this problem as well as a brief survey of existing models and algorithms 
in provided in Section [5] 

In this work we introduce a novel approach to subspace clustering and present extensive comparisons 
to competing methods using both simulated and real high-dimensional genomic data sets. We exploit the 
fact that PCA provides the low-dimensional representation of the data that minimises the reconstruction 
error, and propose a criterion derived from the out-of-sample error as the building block for a subspace 
clustering algorithm. An overview of the contributions presented here is in order. 

First, for a single PCA model, we propose a computationally efficient approach to detect influential 
observations, namely data points exerting a larger effect on the estimated PCA parameters compared to 
other points. Our motivation for detecting influential observations is that, under the assumption that K 
alternative low-dimensional representations of a data point might exist, a measure of influence will provide a 



useful metric for assigning data points to subspaces. A common way to quantify the influence of a data point 
consists of examining the changes in the model parameters when the parameters have been re-estimated 
after removal of that observation [2]. In the context of ordinary least squares (OLS), this is equivalent to 
computing the leave-one-out (LOO) estimate of the regression coefficients with respect to each observation. 
The LOO prediction error is then evaluated using the remaining observations, and influential observations 
are identified as those with a large LOO prediction error relative to other observations. Following up 
previous work done in the context of partial least squares regression |15) . here we propose a closed- form 
expression to compute LOO errors under a PCA model, known at the PRESS statistic, which only requires 
a single PCA model to be fitted to the entire data set. Armed with this analytical expression for the 
PCA PRESS statistic, we propose a notion of predictive influence that an observation exerts on the PCA 
model, and elaborate on our previous work [16J ; intuitively, data points that are highly influential under 
a particular PCA model hay have been generated by a different model. This methodology is presented in 
Section [3l 

Building on this notion of predictive influence, we develop a clustering algorithm that works by inferring 
the distinct low-dimensional spaces that are representative of each cluster. The optimality criterion we 
propose for driving the data partitioning process is such that total within-cluster predictive influence is 
minimised. The resulting algorithm, called Predictive Subspace Clustering (PSC) because of its direct 
dependence upon out-of-sample prediction errors, is an iterative one and is guaranteed to convergence to 
a local minimum. The convergence of the proposed algorithm has been studied, and we provide a detailed 
account. Although inferring the unknown number of clusters is a notoriously difficult problem, model 
selection can be somewhat naturally incorporated into the proposed subspace clustering framework by 
making use of the PCA PRESS statistic. Furthermore, motivated by genomic applications in which the 
detection of a small number of informative variables is important, we also discuss a variation of the PSC 
algorithm which provides sparse low-dimensional representations of the data in each cluster. Forcing sparse 
solutions within our clustering algorithm is accomplished by taking a penalised approach to PCA. These 
developments are presented in Section 0J 

In Section [5] we present extensive simulation results based on artificial data and discuss a number 
of situations where the proposed approach is expected to provide superior clustering results, compared to 
standard clustering methods as well as other subspace algorithms that have been proposed in the literature. 
Different scenarios are considered in which both the number of subspaces and their dimensions is allowed to 
vary, and where some dimensions may be totally uninformative for clustering purposes and only contribute 
to noise. The difficult model selection problem, that is the problem of learning both the number of clusters 
and the dimensionality of each subspace, is also discussed. 



In Section [5] the performance of the proposed PSC algorithm is compared to other subspace clustering 
algorithms using six high-dimensional genomic data sets in which several thousands of gene expression 
measurements have been made on various biological samples. For these data sets, which are all publicly 
available, both the number of clusters and the cluster memberships can be considered known, and this in- 
formation can be used to test and compare the performance of competing clustering methods. It has often 
been observed that only certain variables (i.e. genes) are important for determining the separation between 
clusters. It is therefore desirable to be able to identify and remove any uninformative variables. Our exper- 
imental results demonstrate that the assumptions underlying the proposed PSC algorithm seem plausible 
for real gene expression data sets on which we have obtained state-of-the-art clustering performance. 

2 Subspace clustering: problem definition and existing methods 

We assume to have observed N data points, {a^}^, where each Xi £ K lxP and the dimension P is usually 
very large. Each point is assumed to belong to one of K non-overlapping clusters, {Ck}^ ■ We further 
assume that the points in the k th cluster lie in a Rk— dimensional subspace, Sk where Rk << P. As in 
[29] . we assume that each subspace Sk is defined in the following way 

(1) S k = {xi : Xi = ii k + u k jV k T } 

with i € Ck and k = 1, . . . , K, where Vk <E M. PxRk is a basis for Sk whose columns are restricted to be 
mutually orthonormal. The point Uk,i & M. Rk is the low dimensional representation of Xi and fj,k G M. p is 
an arbitrary point in Sk-, typically chosen to be 0. 

When only one cluster exists (i.e. K = 1), a subspace of this form can be estimated by fitting a single 
global PCA model. Alternatively, where K > 1 and the assignment of points to clusters is known, each 
one of the K subspaces of form ((TJ) can be estimated by fitting a PCA model independently in each cluster. 
However, the cluster assignments are typically unknown, and the problem consists in the simultaneous 
partitioning the data into clusters and estimating cluster-specific subspaces. There are several fundamental 
difficulties associated with this problem: (a) identifying the true subspaces is dependent on recovering the 
true clusters and vice-versa; (b) subspaces can intersect at several locations which causes difficulties when 
attempting to assign points to subspaces at these intersections, and standard clustering techniques such as 
K-means may not be suitable; (c) the subspace parameters and the cluster assignments are dependent on 
both number of clusters and the dimensionality of their respective subspaces, which pose difficult estimation 
challenges. 

Furthermore, in problems where P is large, some of the variables may be uninformative for clustering. 
In some applications, there might also be an interest in selecting a specific subset of dimensions that are 



highly discriminative between clusters, and can be more easily interpreted. In such situations, we may 
be interested in carrying out sparse subspace clustering by adding some form of regularisation during the 
estimation of the eigenvectors in V, such that each eigenvector contains a predetermined number of zero 
coefficients. To the best of our knowledge, the problem of variable selection in subspace clustering has not 
been widely studied until now. 

A variety of approaches have been proposed to solve the subspace clustering problem. Several methods 
are based on generalising the widely used K-me&ns algorithm to -ftT-subspaces [4j [30] . These methods iter- 
atively fit PCA models to the data and assign points to clusters until the PCA reconstruction error in each 
cluster is minimised. Although the approach based on minimising the within-cluster PCA reconstruction 
error is simple and has shown promising results, it is also prone to over-fitting. For instance, the data 
may be corrupted by noise or lie on the intersection between subspaces and so points within clusters may 
be geometrically far apart; we illustrate this problem using artificial data in Section [SJ Furthermore, each 
subspace may have a different intrinsic dimensionality. The PCA reconstruction error decreases monoton- 
ically as the dimensionality increases, so points may be wrongly assigned to the cluster with the largest 
dimensionality. Such an approach therefore limits the number of dimensions to be the same in each clus- 
ter. Other approaches to subspace clustering have been taken based on mixtures of probabilistic PCA 
|28) . The recently proposed mixtures of common t- factor analysers (MCtFA) attempts to overcome the 
problems posed by over-fitting and potential outliers by assuming that the clusters share common latent 
factors which, instead of being normally distributed, follow a multivariate i-distribution pQ. 

A different class of subspace clustering algorithms involves the computation of a measure of distance 
between each pair of points which captures the notion that points may lie on different subspaces. The dis- 
tances are then used to construct an affinity matrix which is partitioned using standard spectral clustering 
techniques |13j . There have been several successful approaches to defining such a distance measure. The 
method of Generalized PCA (GPCA) fits K polynomials of varying order to the data and measures the 
distances between the gradient of the polynomials computed at each point |14j . Sparse subspace clustering 
(SSC) obtains a local representation of the subspace at each point as a sparse weighted sum of all other 
points [9]; this is obtained by minimising the reconstruction error subject to a constraint on the £± norm 
of the weights so that the few non-zero weights correspond to points lying on the same subspace. Spectral 
curvature clustering (SCC) constructs a multi-way distance between randomly sampled groups of points 
by measuring the volume of the of the simplex formed by each group [7]. Points which lie on the same 
subspace will define a simplex of volume zero. Spectral local best flats (SLBF) estimates a local subspace 
for each point by fitting a PCA model to its nearest neighbours; it then computes pair-wise distances 
between the locally estimated subspaces corresponding to each point [35] . Although spectral methods have 



achieved state-of-the art results in some application domains such as clustering digital images, they come 
with their own limitations. Computing local subspaces for each point can be computationally intensive 
and requires additional tuneable parameters. 

3 Detecting influential observations in PCA 

3.1 Influential observations in ordinary least squares 

An influential observation is defined as one which, either individually or together with several other observa- 
tions, has a demonstrably larger impact on the model estimates than is the case for most other observations 
[2J. Unlike outliers, influential observations are not necessarily apparent through simple visualisations of 
the data, and therefore a number of approaches have been proposed to detect them [21 . Several popu- 
lar methods rely on the computation of the leave-one-out (LOO) residual error [2J[TB]. For instance, in 
the context of ordinary least squares (OLS), a common strategy consists in quantifying the effects that 
removing a single observation and re-fitting the model using the remaining N — 1 observations has on the 
estimated regression coefficients [6j. 

When N data points {xi,yi}^ are available, where each Xi € R p is a covariate and yi £ M. is the 
associated univariate response, the LOO error for the i observation is defined as e_i =y t — Xi(3^i where 
(3-i G M Pxl has been estimated using all but the i th observation. The sum of LOO errors is then as 
obtained as 

N 



( 2 ) J ols4E 






A naive computation of Jols requires fitting N OLS models, each one using N — 1 observations. For 
each model fit, the inverse of sample the covariance matrix, P ', has to be estimated. In order to contain 
the number of computations needed to evaluate JolSj each LOO error can instead be found in closed- form 
after fitting a single regression model with all the N data points. This is particularly beneficial when 
either N or P is large. Since each (3-i differs from (3 by only one pair of observations, {xj, yt}, a recursive 
formulation of /3_j that only depends on (3 is obtained by applying the Sherman-Morrison theorem |25j . 
as follows 

(3) /3-i = (X T X - xJ Xi y l (X^y xjy t ) =(3- ^- x f Px ^ 

1 X^ ±Xi 

In this formulation, each /3-i is readily available as a function of (3, without the need to explicitly remove 
any observations, and without having to re-compute N inverse covariance matrices. When this approach 
is taken, Eq. ^ is known as the Predicted REsidual Sum of Squares (PRESS) statistic [2]. 



In the following Section we propose an analogous PRESS statistic for PCA and describe a methodology 
for detecting influential observations under a fitted PCA model. This approach will then be used to define 
an objective function for subspace clustering in Section [4] 

3.2 An analytic PCA PRESS statistic 

PCA is a ubiquitous method for dimensionality reduction when dealing with high-dimensional data. It 
relies on the assumption that the high-dimensional observations can be well approximated by a much 
lower-dimensional linear subspace which is found by estimating the best low-rank linear approximation of 
the original data |12j . One way of achieving this is by estimating a set of R mutually orthonormal vectors 
[in 1 ), . . . , v*- R '] which minimize the £2 reconstruction error, defined as 

N R 



(4) -J^Wxi-x^v^v 



(r) 

N 



i—l r—1 

On defining a matrix X G M. NxP with rows a;.;, which we assume to be mean-centred, the vectors 
which minimise (j4]) are obtained by computing the singular value decomposition (SVD) of X, given by 
X = UAV T . Here, U = [u (1 \ ..., u^'] G R NxN and V = [t> (1) , ..., v (p) ] G M PxP are orthonormal matrices 
whose columns are the left and right singular vectors of X, respectively, and A = diag(A^\ . . . , A^') G 
M JVxP is a diagonal matrix whose entries are the singular values of X in descending order. 

It can also be noted that, when taking the first principal component, the residual error can be written 
as a quadratic function of v^\ 

1 N T 

(5) l^H^-dfV 1 ) || 2 , 

where cQ = XiV^\ This formulation suggests that the eigenvector v^ can be rewritten in the form of a 
least squares estimator, 

(6) v^ = (d^ T svy 1 (x T d« 

where we have defined the vector d 1 - 1 ' = [d\ , . . . ,d N ] T ; analogous expressions exist for the remaining 
eigenvectors, v( 2 \ . . . , v^ R \ Clearly, each d^ depends on v^ r \ and these eigenvectors are obtained as 
usual, using the SVD. This least squares interpretation will provide a starting point for deriving an efficient 
PRESS statistic for PCA. 

In the context of PCA, the LOO error has often been used to drive model selection as well as detect 
influential observations |19l 15]. When R principal components are considered, this reconstruction error is 
given by 



(7) e>_> = Xi - x t . 2^ vl(v 



R 

( R ) _ ^ ~ v^ „.<>■)„» 

i ' 
r=l 



where each v_j is the r th right singular vector of the SVD estimated using all but the i th observation of 
X , and the sum of all LOO reconstruction errors is 

The usual approach for the evaluation of JpcA re Q u i res N SVD computations to be performed. Clearly, 
this approach is computationally expensive when either N or P is large, especially if ((5J) has to be evaluated 
a number of times. 

We propose a novel closed-form approximation of (J8J which is computationally cheap and such that the 
approximation error is negligible for any practical purposes. The only assumption we make is that, for a 
sufficiently large sample size N, the error made by estimating the R eigenvectors v^>, v^ 2 \ . . . , v^ using 
the SVD of the reduced data matrix with (V — 1) rows, that is after removal of the i observation, is 
sufficiently small, compared to the analogous estimation using the full data matrix. Under this assumption, 
we have that v_- ca v^ r ' and therefore XiV_\ ~ XiV^ r ' for i = 1,...,N. We note here that a similar 
assumption has been made in other applications involving high-dimensional streaming data, where it is 
called Projection Approximation Subspace Tracking (PAST) |33j . 

With only one principal component, this approximated PCA PRESS statistic can be written as 

1=1 

where d\ — a;,?/ 1 ), as before. Since v^ can be expressed as the solution to a least squares fit, as in ([5]), 
we can obtain v^i in closed-form through least squares estimation. Using the relationships 

N N 

£#> = d^ T d^ _ d ^\ j2 Xjd f ) = xd^ _ Xld f) 

we arrive at an expression for v_i by applying a single-observation, down-dating operation, as follows 
(10) v™ = (dW T dW - d^ 2 ) ~ % (x T d^ - x]df ] 



The advantage of this reformulation is that Eq. (|10p can be evaluated using the same recursive form 
given by Eq. ([3]). Using the same recursive approach, we then obtain an expression for v_ t in terms of the 
original eigenvector v 1 - 1 ' , 

(xJ-d^v^D^d^ 
(11) ^^CD-I. \) -, 

1 - h l 

where h\ = d\ D^d\ , D^ — ld^ d^ 1 ') , and i/ 1 ' has been obtained by computing the SVD of 
the complete data matrix X in the usual way. Using this recursive expression, the i th LOO reconstruction 



error is obtained in closed-form 



(12) e«=»«-d« „CD_W • ) 



1-h™ 



By substitution, using the fact that the the i th reconstruction error is ef = Xi — XiV^v^ , we then 
obtain 



(13) e -= e < f 1 '!-^);^^ 

which can be computed in closed form, without the need for explicit LOO steps. This construction can be 
easily extended to the case of R > 1 principal components. When R = 2, we have 






,( 2 ) _ „. _ W(l)„(l) ' _ W(2)»,( 2 ) ' - _fi" ,v( 2 U 2 ) 



1-^ W 



Adding and subtracting x^ from both sides we obtain 



(!) -r 2 M 
e ( 2 ) - _£l + x _ rf (2) v (2) T _ _ V- _£< 



where e 2 = a;, — XiV^v^) . Therefore, with R principal components, the approximated PCA PRESS 
statistic is 



, N R (r) 

( 14 ) ^a^^EhEt^tm-^- 1 )^ 

i— 1 r- 



This expression only depends on quantities estimated using a single PCA model fit, and can be computed 
cheaply. The approximated JpcA ^ s f° un d to be close to the true PCA PRESS JpcA under the assumption 
that the error made by estimating the R principal components by N — 1 instead of N observations is 
small. In related work, we have previously shown that the approximation error depends on the number of 
observations as 0(-\/log N/N) [15]. This result can also be checked by simulation. Figure Q] shows the mean 
squared approximation error as a function of N using data simulated under a PCA model with P = 500 
variables. It can be seen that the decrease in the approximation error follows the theoretical error (plotted 
as a dashed line) and the difference in computational time increases super-linearly. 

3.3 A measure of predictive influence in PCA 

In this section we make use of the closed- form PCA PRESS statistic of Eq. (fl4|) . and propose an analytic 
influence measure for the detection of influential observations under a fitted PCA model. Specifically, we 
introduce the notion of predictive influence of an observation Xi under a given PCA model which quantifies 
how much influence xt exerts on the LOO prediction error of the model. By making explicit use of the out- 
of-sample reconstruction error, this influence measure is particularly robust against over fitting compared 
to the in-sample reconstruction error that is minimised by PCA, as in Eq. @. 
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Figure 1: The top plot shows the mean squared approximation error between the leave-one-out cross 
validation Jpca and our analytic PRESS statistic Jpca as a function of the number of samples, N. We 
report on averages over 100 Monte Carlo simulations. It can be seen that the empirical approximation 
error scales according to the theoretical error, which is shown as a dashed line. The bottom plot shows 
that the difference in computational time between Jpca and the Jpca increases super-linearly with N. 



11 



As before, we assume that a PCA model has been fit to the data, and V = [vd', ..., v^} £ R PxP is 
the orthonormal matrix whose columns are the right singular vectors of X. Analytically, the predictive 
influence of observation Xi, denoted here Tt(xi; V) is defined as the gradient of the PCA PRESS given in 
Eq. (HU) with respect to observation Xi, that is 

dJ [R) 



(15) ir(xi-,V) 



'PCA 



dxi 

Using the results from the previous section, an analytical expression for the gradient in Eq. (|15p can 
be derived, and is found to be 



(l p - t>( r V r ) 



(16) <*H V) = e^ Y. ^7 7^- -t R -^ 

V=i (i - K \ 

The full details of this derivation are given in Appendix |XJ It can be seen that the predictive influence of 
a point, Xi under a PCA model has a closed form which is related to the leave-one-out error of that point, 
e_j. 

It was observed in [5] that single deletion methods for identifying influence, such as the PRESS, are 
only sensitive to large errors. Instead they propose observing the change in the error as each observation is 
weighted between (equivalent to LOOCV) and 1 (equivalent to the standard residual) and then computing 
the derivative of the residual with respect to the weight. On the other hand, our proposed predictive 
influence measures the sensitivity of the prediction error in response to an incremental change in Xi. The 
rate of change of the PCA PRESS at this point is given by the magnitude of the predictive influence 
vector, ||7r(:Ej; V)|| 2 . If the magnitude of the predictive influence is large, this implies a small change 
in the observation will result in a large change in the prediction error relative to other points. In this 
case, removing such a point from the model would cause a large improvement in the prediction error. We 
can then identify the most influential observations as those for which the increase in the PRESS is larger 
relative to other observations. Since we take the gradient of the PRESS with respect to the observations, 
we arrive at a quantity which is more sensitive to individual influential observations than the original 
PRESS function. In this work, the predictive influence measure has been used in the context of subspace 
clustering, which is introduced next. 

4 Predictive subspace clustering 

4.1 The PSC algorithm 

The proposed algorithm for subspace clustering relies on the following observation. If the cluster assign- 
ments {Cfe}f were known and a separate PCA model was fit to the data in each cluster, then the predictive 
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influence of a point Xi belonging to cluster Ck would be smaller when evaluated using the correct PCA 
model for that cluster, than using any of the remaining K — 1 PCA models. In this respect, the predictive 
influence provides a metric that can be used to drive the clustering process. Since our clustering algo- 
rithm is based on a measure of predictive ability of the estimated subspaces, we call it Predictive Subspace 
Clustering (PSC). 

The objective of the clustering algorithm is to partition the N observations into one of K non- 
overlapping clusters such that each cluster contains exactly Nk observations and ^2k=i ^k — N and where 
the points in each cluster lie on a subspace of the form described in Eq. ([1}. In our proposal, this will 
be accomplished by searching for the cluster allocations and PCA model parameters that minimise the 
following objective function, 

K 

(17) c(Vi, . . . , v fc ,Ci, . . . ,c K ) = J2 J2 11**0*; yfe )H 2 ' 

fc=l i£C k 

where iTk{xi\ 14) is the predictive influence of a point Xi under the k fh PCA model. Since the cluster 
allocation and PCA model parameters depend on each other, there is no analytic solution to this problem 
and we must resort to an iterative procedure. 

This problem can be approached by considering the two related optimisation tasks. At a generic 
iteration r, given the K subspaces with parameters Vi,...,Vk which were estimated in the previous 
iteration r — 1, we search for the cluster assignments minimising (|17p , 

(18) min C(V 1 (T - 1 \...,VJ i T - 1 \c 1 ,...,C K ). 

Using the predictive influences ttjJ {xi\ VL ) , which were computed at the end of the previous iteration 
for all i = 1, . . . , n and all k — 1, . . . , K , a solution for (fT5|) is found by assigning each point Xi to the 
cluster for which the magnitude predictive influence is smallest, that is 

(19) Ci r) ^ ^i:mm\\^- 1 \^V^ 1) )\\ 2 

Then, in the second step, given the new cluster assignments {CfJ }± , we re-estimate the parameters of 
the K subspaces minimising (|17[) by solving 

(20) min C(Vi,...,V K ,c[ T \...,cP). 

vi,...,Vk 



Using the current cluster assignments, a solution for Eq. (|20|) is found by re-fitting all K PCA models 
because, from Eq. (|16[) . minimising the PCA construction error is also seen to minimise the predictive 
influences. Once all the parameters {Vl ; }f" have been re-estimated, the predictive influence measures 
n k ( x i'i V) f° r all fe = 1, 2, . . . , if and i = 1, 2, . . . , N, are updated for the subsequent iteration. 
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The algorithm is initialised by generating an initial random partitioning, {C k }i , which is used to 
estimate the K initial PCA models, find the initial parameters {V fe }f , and compute the corresponding 
predictive influences, 7rj_ (a;,; V) for k = 1, . . . , K and i = 1, . . . ,N. At each iteration r = 1,2,..., the 
algorithm alternates between the two steps described above until convergence is reached and the objective 
function can no further be reduced. 



4.2 Convergence of the PSC algorithm 

In this section we demonstrate that, for any initial configuration, the PSC algorithm is guaranteed to 
converge to a local minimum of the objective function (|17[) . In order to keep the notation simple, and 
without any loss of generality, we only discuss the case of the first principal component. Furthermore, we 
denote S^ — {«£ , • ■ • > % } the set of all PCA parameters at a generic iteration of the algorithm, r; we 
also use C^ T ' = {C{ T , . . . ,C^ } to denote the set of clustering assignments at the same iteration. With this 
notation in place, in order to show the algorithm converges we must demonstrate that, at each iteration r, 
the following inequalities are satisfied: 

(a) C{S {T - l \,C^) < C(S ( r- 1 \C<- r - 1 ')), after the first step, dHJ); 

(b) C{S {t \C {t ^) < C(S( T - 1 >,C( T >), after the second step, l[20jl. 

Checking that the first inequality holds after the first step of the algorithm is straightforward because, 
by definition, the cluster assignments C^ T ' obtained by the assignment rule (|19p directly minimise the 
objective function when the PCA parameters are held fixed. The second inequality, however, requires a 
more elaborated argument. 

Assuming that the clustering memberships are given and held fixed, we first note that solving (|20[) is 
equivalent to solving the following problem 

JV 

(21) minV ||7r(a; J ;t;)|! 2 , 

i=\ 

subject to ||r>|| = 1, 

separately for each cluster. The following lemma provides an alternative formulation of this optimisation 
problem. 

Lemma 1. Solving the minimisation problem (|2ip is equivalent to solving the following maximisation 
problem 

(22) m&xv T X T E- 2 Xv, 

subject to \\v\\ = 1. 
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where S € K is a diagonal matrix with diagonal elements Sj = (1 — /i^) 2 . 

The proof of this lemma is provided in Appendix [B] The maximisation in (|22p can be recognised as 
an eigenproblem where each observation has been weighted by a function of its leverage under the PCA 
model. We denote the optimal parameters that provide a solution to this problem as S* = {vl}^ . If such 
solution was available, it would satisfy inequality (b) above, so that 

C(S*,C (r) ) <c(s< r - 1 \cM). 

The solution S* depends on the diagonal element of S, which in turn depend on the PCA parameters 
and are non- linear function of v, and therefore a closed- form solution to (|22|) cannot be found simply by 
eigendecomposition of X T a~ 2 X. However, using this formulation, we are able to prove that the PCA 
parameters at iteration r are closer to the optimal solution S* than the parameters from the previous 
iteration, i.e. S^ 1 " -1 ). 

Lemma 2. For a single cluster k, we define the error between the the optimal parameters v^ obtained by 
solving (|22[) directly and the PCA parameters from the previous iteration, vjj , as 

Analogously, the error between the optimal parameters and the PCA parameters obtained at the current 
iteration r is defined as 

These two error terms satisfy the inequality 

(23) E{S*, S {T) ) < E{S*,S( T -V). 



The proof of this lemma is provided in Appendix [CJ We are now ready to formulate the main result 
stating the convergence of the PSC algorithm. 

Theorem 1. Starting with any cluster configuration, {Cj^ j\> a ^ each iteration of the PSC algorithm 
the objectives function (|17[) is never decreased, and the algorithm converges to a local minimum. 

Proof. The proof of this theorem simply follows from the observations made above, and the two lemmas, 
which show that the inequalities (a) and (b) are satisfied at each iteration. 

□ 
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4.3 Sparse subspace clustering 

As mentioned in Section [JJ there may be many unimportant variables that only contribute to noise and 
therefore could be discarded. Moreover, in some applications, there might be the need to select only 
a handful of informative variables that best represent the subspace in each cluster. Since the principal 
components involve linear combinations of all P variables in X, we present a variation of the PSC algorithm 
that build on a sparse PCA model estimation procedure. 

Following the strategy described by |24j . sparse loading vectors can be obtained by imposing an l\ 
penalty on the PCA objective function (j4|. When R = 1, this new optimisation problem become 

(24) mm\\X-uv T \\ 2 F + 1 \\v\\ 1 

subject to ||m|| = 1. 

where ||A|||, = Tr(A T A) denotes the squared Frobenius norm. This problem can be solved by first 
obtaining u = u^ 1 ' and v = a^v^ where u^\v^ and a^ are the first left and right singular vectors 
and corresponding singular value computed from the SVD of X. A sparse solution is obtained by applying 
the following iterative soft thresholding procedure to the elements of v. 

(25) v=sgn(X T u) (\X T u\ - 7 ) + 

Xv 



(26) u 



\Xv\ 



The updates ([23)) and ([23)1 are applied iteratively until the change in v between iterations falls below 
some threshold. Subsequent sparse loadings can be found by deflating the data matrix and repeating the 
above steps as in 21] ■ The complexity of this procedure when R sparse principal components are required 
is O(NPR), which keeps the computational burden relatively low. 

A penalised version of the PSC algorithms is obtained by modifying the second step of the algorithm 
so that (|20p is replaced accordingly; in the case of the first principal component, this amounts to solving 

(27) min C(v 1 ,...,v k ,C u ...,C K ) 

{i>i,...,ijk} 

subject to \\vk\\i < 7fe for k = 1, . . . ,K 

where the parameter which controls the level of sparsity, 7^, can in principle be different for each cluster. 
It can be seen that there are K inequality constraints, one for each of the subspaces. This sparse version 
of the PSC algorithm is detailed in Algorithm [TJ Clearly, when the sparsity parameters are taken to be 
zero, we obtain the the unpenalised version of the algorithm and no variable selection is performed. 



1G 



Input: X, K, R k , lk ,C k , fork = l,...,K 
Output: {d,...,C K }, {V U ...,V K } 
while not converged do 
for k = 1 , . . . , K do 

// Compute the sparse subspace parameters 

X k <- {xi} e C k ; 

[ w (i) ) CT (i) )V (i)] ^ sv d(X k ,R); 

for r = 1 , . . . , i?^ do 

while change in v^ r ' > tol do 

t/ r ) ^sgn(X T «( r )) (|X T w( r )| - 7 ) + ; 

(r) , Xv^ . 
U ^ \\Xv^\\ ' 

end 

X k ^X k - u WvW T 
end 

Compute PRESS, J k ; 

for i = 1 , . . . , N do 

Compute predictive influence, 7Tfc(xi; V); 

end 
end 

// Assign points to clusters 
for i = 1 , . . . , N do 



Cr^:™!!^- 1 ^;^)!! 2 }. 



*(r) 



end 

// Check for convergence 

it J2ti 4 T) -Jt 1} > tol then 

converged; 
end 

r <- r + 1: 



end 



Algorithm 1: The penalised PSC algorithm. 
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4.4 Model selection 

Model selection in subspace clustering consists of learning both the number of clusters and the dimension- 
ality of each subspace. Each one of these two problems is particularly challenging on its own. Since the 
PSC algorithm is a non-probabilistic one, traditional model selection techniques for learning the number 
of clusters such as the Bayesian Information Criterion and related methods do not apply. Moreover, it 
has already been noted that such methods, when can be used, do not always select optimal models that 
maximise clustering accuracy pQ. 

In the subspace clustering literature, the problem of model selection is still in its infancy. A method 
called second order difference (SOD) has recently been proposed to determine the correct number of clusters 
[55] . The SOD method can be seen as an extension of the gap statistic originally proposed by [37] , and 
works as follows. For each value of if up a maximum number of clusters, K max , the cluster assignments 
and corresponding subspaces as in ([T]) are estimated. For each value oi K, we compute the distance between 
points and subspaces within each cluster, as follows 

K 



wk = ^J2wT,\\^- x ^ v ^\ 



2 



K ^ N k 
fe=i zee,. 



The second derivative of the within-cluster residual error with respect to the number of clusters K is 
approximated by 

SOD(log W K ) = log W K -i + log W K+1 - 2 log W K , 

and this quantity is used as a search criterion. The optimal number of clusters, K* is then chosen to be 
the value that maximises this criterion, 

K* = maxSOD(logW K )- 

This optimal choice corresponds to the point beyond which increasing the number of clusters has less effect 
on the within-cluster residual error. 

Apart from the SOD method, our proposed PCA PRESS statistic (TUT) also naturally provides an 
alternative criterion for model selection. As with SOD, we initially assume that the dimensionality of each 
subspace is known, and is the same for all subspaces, so that Rk — R for all k. Using a varying number 
of clusters up to a pre-determined maximum, K max , we run the PSC algorithm and, for each K, record 
the corresponding value of the PCA PRESS statistic, that is Jp CA (K). The optimal number of cluster is 
found as 

K* = mmJ^ A (K). 

This approach is somewhat similar to SOD, however instead of measuring the within-cluster PCA residual 
error, which may suffer from over-fitting, we use the within cluster LOO cross-validation error, which 



18 



negates the requirement to compute the second derivative with respect to K . The PCA PRESS is robust 
against over fitting and has been found to work generally well in our experiments presented in the next 
section. 

The problem of learning the subspace dimensions {Rk}^ has not been well studied in the subspace 
clustering literature, and is very much an open question. However, the PRESS has often been used to 
perform model selection in standard PCA, and seems to be well suited for this problem. With a given fixed 
K, at each generic iteration r of the PSC algorithm, we evaluate the PCA PRESS using all values of R k T 
from one to a pre-determined maximum, R m ax, and then select the set {R k }\ of subspace dimensions 
that minimises the PCA PRESS at that iteration. Although computationally more expensive, this strategy 
has often been found successful in learning these parameters. Some experimental results are reported in 
the next section. 

Finally, an important feature of the PSC algorithm is its ability to estimate sparse PCA models. In 
order to obtain sparse models, we need to specify additional parameters {7fc}f which control the number 
of variables to be retained within each cluster. This introduces further complications to the issue of model 
selection. In the context of sparse predictive modelling, it has often been observed that prediction-based 
methods such as the PRESS do not perform well for selecting the optimal sparsity parameter. Recently, 
subset resampling methods such as stability selection have shown promising results for accurately selecting 
regularisation parameters [17] . However, implementing such data resampling schemes within the PSC 
algorithm would be computationally prohibitive. 

5 Simulation experiments 

We start by presenting a number of simple simulation studies in low dimensions to illustrate the type 
of clusters that can be detected by the PSC algorithm, and compare the performance of the proposed 
algorithm to existing methods. Clusters of 100 data points were generated and, within each cluster, the 
points were distributed uniformly on a one, two and three-dimensional linear subspace embedded in a 
three-dimensional space. To define each subspace, we generated a set of Rk orthonormal basis vectors, 
each of dimension P = 3, where each element was sampled from a standard normal distribution. For each 
cluster we then sampled 100 it^-dimensional points from a uniform distribution which were then projected 
onto its corresponding subspace. 

In particular, we present four simulation scenarios consisting of points which lie on: (a) two straight 
lines; (b) a straight line and a 2-D plane; (c) two 2-D planes; (d) a straight line, a 2-D plane and a 3-D 
sphere. A typical realisation of each scenarios is illustrated in Figure [2[ and for each case we show the 
original data points in P dimensions (left), the clustering assignments using K- means clustering in the 
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original three-dimensional space (centre), and the clustering assignment using PSC (right). It can be noted 
that the subspaces always intersect so points belonging to different clusters will lie close to each other at 
these intersections. The if-means algorithm, which uses the Euclidean distance between points and has 
been applied directly to the three-dimensional observations, consistently fails to recover the true clusters, 
as expected in these cases. On the other hand, PSC correctly recovers both the true clusters and the 
intrinsic dimensionality of the subspaces. Despite the simplicity of these initial illustrations, they highlight 
the benefits of subspace clustering. 

A Monte Carlo simulation study was carried out to compare the performance of five competing clus- 
tering methods: the proposed PSC algorithm, the if -means algorithm (as a benchmark), GPCA |14j . 
if -subspaces [3U], SCC [7] and Mixture of Common t- Factor Analysers (MCtFA) [T]. In our studies, in 
order to make the comparisons fair, we provide each of these methods with the true number of clusters 
and the true dimensionality of each subspace. In addition to the four simulation settings considered above, 
we also considered an additional scenario, (e), consisting of four distinct subspaces: a 5-D hyperplane, 4-D 
hypersphere and two lines embedded in P = 200 dimensions. Since the dimensionality of this scenario is 
large, in order to use the GPCA algorithm we must first perform PCA to reduce the dimensionality to 
P = l. 

Results for this comparison are given in Table Q] which reports on the mean clustering accuracy over 
100 Monte Carlo simulations. It can be seen that PSC achieves a consistently high clustering accuracy 
in all scenarios with a small standard error (reported in parenthesis). SCC also performs extremely well 
for scenarios (a)-(c) but does more poorly for (d) and (e). The adjusted rand index (ARI) for the same 
experiments is reported in Table O In this case, the relative difference in performance between PSC and 
two other subspace clustering algorithms, GPCA and if -subspaces, increases significantly; PCS compares 
favourably in all scenarios. As before, SCC also shows performance comparable to PSC for (a)-(c), but 
does less well in (d)-(e). As expected, if -means always performs poorly. 

Finally, we explored the performance of these algorithms on a more challenging set of artificial data. 
Firstly, compared to the previous examples, we increased the dimensionality of all datasets to P = 200. 
Secondly, in order to construct the low-dimensional subspaces, we generated sparse loading vectors where 
only 10 randomly chosen variables, out of 200, had non-zero coefficients, so that the remaining 190 variables 
do not contribute to clustering. Gaussian noise with zero-mean and variance 0.5 was added. Each of the 
Rk loading vectors in each cluster had the same number of non-zero elements, but the sparsity pattern 
in each vector was different. This data generating mechanism allows us to test the ability to estimate 
subspace parameters and simultaneously recover cluster assignments when there is a large number of noisy, 
irrelevant variables. Both the true level of sparsity and the true number of clusters are assumed known. 
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Figure 2: Four illustrations of data clusters existing in subspaces of R 3 and the clusters recovered by two 
algorithms, K-means and the proposed PSC algorithm. In these examples PSC recovers the true cluster 
assignments and estimates the subspaces correctly. 
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Scenarios 



Algorithm 



PSC 0.980 (0.095) 0.999 (0.003) 1.00 (0.000) 0.942 (0.042) 0.943 (0.089) 

GPCA 0.846 (0.226) 0.786 (0.199) 0.921 (0.176) 0.791 (0.088) 0.837 (0.068) 

A-means 0.503 (0.010) 0.525 (0.028) 0.595 (0.097) 0.635 (0.079) 0.702 (0.034) 

A-subspaces 0.702 (0.132) 0.804 (0.151) 0.829 (0.149) 0.766 (0.107) 0.877 (0.076) 

SCC 0.999 (0.004) 1.00 (0.000) 1.00 (0.000) 0.754 (0.079) 0.893 (0.073) 

MCtFA 0.812 (0.158) 0.984 (0.071) 0.883 (0.162) 0.942 (0.118) 0.974 (0.049) 

Table 1: Mean clustering accuracy for six competing clustering algorithms obtained by using five simulated 
(non-sparse) data sets over 100 Monte Carlo simulations. Standard errors are reported in parenthesis. 

Scenarios 



Algorithm 



PSC 0.960 (0.189) 0.999 (0.006) 1.00 (0.000) 0.877 (0.181) 0.943 (0.226) 

GPCA 0.693 (0.451) 0.572 (0.397) 0.843 (0.351) 0.545 (0.179) 0.576 (0.171) 

A-means 0.007 (0.021) 0.055 (0.055) 0.190 (0.194) 0.198 (0.176) 0.213 (0.093) 

A-subspaces 0.401 (0.263) 0.609 (0.302) 0.659 (0.297) 0.484 (0.234) 0.687 (0.186) 

SCC 0.998 (0.008) 1.00 (0.000) 1.00 (0.000) 0.454 (0.175) 0.719 (0.188) 

MCtFA 0.625 (0.315) 0.966 (0.142) 0.767 (0.323) 0.879 (0.246) 0.936 (0.119) 

Table 2: Mean AM for 6 competing clustering algorithms obtained by using five simulated (non-sparse) 
data sets over 100 Monte Carlo simulations. Standard errors are reported in parenthesis. 

For this test case, we compared the performance of PSC using three different levels of sparsity: 10 (correct 
sparsity), 100 and 200 (no sparsity). We compare to all of the previous methods but replace AT-means with 
Sparse Af-means (SKM), which is also able to perform sparse clustering [35], using the correct sparsity 
level. 

Table [3J shows the mean clustering accuracy over 100 Monte Carlo simulations. When the correct 
number of informative variables is used, PSC obtains the highest clustering accuracy out of all methods in 
all scenarios. This is not surprising though, given that the other algorithms to do have any built-in variable 
selection mechanisms. Amongst the other subspace clustering algorithms, SCC achieves the best results 
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Scenarios 
Algorithm a b c 



PSC - 10 vars 0.776 (0.067) 0.887 (0.058) 0.941(0.034) 0.781 (0.045) 0.812 (0.033) 
PSC - 100 vars 0.731 (0.111) 0.740 (0.121) 0.549(0.061) 0.781 (0.045) 0.812 (0.031) 
PSC - P vars 0.643 (0.149) 0.708 (0.150) 0.533(0.068) 0.680 (0.099) 0.712 (0.041) 
GPCA 0.500 (0.021) 0.500 (0.014) 0.524 (0.009) 0.634 (0.034) 0.711 (0.016) 

SKM - 10 vars 0.575 (0.081) 0.667 (0.137) 0.787 (0.147) 0.590 (0.027) 0.637 (0.027) 
A'-subspaces 0.499 (0.005) 0.499 (0.004) 0.498 (0.005) 0.560 (0.006) 0.632 (0.005) 

SCC 0.607 (0.078) 0.729 (0.091) 0.832 (0.067) 0.612 (0.033) 0.733 (0.034) 

MCtFA 0.596 (0.115) 0.650 (0.156) 0.673 (0.161) 0.574 (0.091) 0.663 (0.104) 

Tabic 3: Mean clustering accuracy for six competing clustering algorithms obtained by using five simulated 

(sparse) data sets over 100 Monte Carlo simulations. Standard errors are reported in parenthesis. 



Scenarios 
Algorithm a b c 



PSC - 10 vars 0.551 (0.135) 0.774 (0.116) 0.883 (0.069) 0.509 (0.098) 0.509 (0.085) 

PSC - 100 vars 0.394 (0.221) 0.416 (0.321) 0.099 (0.122) 0.389 (0.108) 0.277 (0.086) 

PSC - P vars 0.287 (0.277) 0.415 (0.318) 0.034 (0.099) 0.339 (0.174) 0.196 (0.073) 

GPCA 0.096 (0.033) 0.099 (0.034) 0.132 (0.0033) 0.174 (0.077) 0.232 (0.043) 

SKM - 10 vars 0.151 (0.163) 0.335 (0.275) 0.575 (0.292) 0.080 (0.057) 0.076 (0.052) 

A-subspaces 0.002 (0.009) 0.003 (0.009) 0.004 (0.009) 0.006 (0.014) 0.015 (0.013) 

SCC 0.213 (0.156) 0.457 (0.182) 0.663 (0.134) 0.134 (0.071) 0.290 (0.091) 

MCtFA 0.194 (0.229) 0.303 (0.311) 0.347 (0.320) 0.159 (0.135) 0.284 (0.159) 

Table 4: Mean ARI for 6 competing clustering algorithms obtained by using five simulated (sparse) data 
sets over 100 Monte Carlo simulations. Standard errors are reported in parenthesis. 

in all scenarios. The unpenalised version of PCS and SCC are directly comparable in this context, and 
they both achieve similar performances overall, which SCC performing very well for scenario (c). Clearly, 
when the incorrect number of variables are selected by PSC, its performance decreases. Importantly, PSC 
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Scenarios 

Algorithm a b c d e 

SOD - only K 0.86 0.66 0.90 0.45 0.35 

PRESS - only K 0.89 1.0 0.96 0.62 0.70 

PRESS - both K and R k 0.73 0.84 0.91 0.60 0.51 

Table 5: Performance of PSC used jointly with SOD and the PRESS criterion for learning the number 
of clusters, K. Reported here is the number of times the correct K was selected across 100 Monte Carlo 
simulations. The description of the five data sets is in the text. 

performs better than Sparse if-means which underlies the importance of performing variable selection in 
each subspace separately since different variables may be important in each clusters. As we also expect, 
the performance of PSC degrades as the dimensionality of the subspaces increases. This is due to the 
constraint that basis vectors of each subspace must be mutually orthonormal. Therefore, if the incorrect 
sparsity pattern is estimated in the first loading, all subsequent loadings are also likely to be estimated 
incorrectly. However, since PSC still performs better than all other algorithms in settings (d) and (e) when 
some sparsity is imposed, this further highlights the benefit of estimating the underlying subspaces using 
only the truly important variables. 

The mean adjusted rand index reported in Table H] further highlights that this particular setting is 
particularly challenging for all the algorithms. Here the degradation in performance of PSC when the 
wrong number of variables are selected appears more clearly. Without sparsity, PSC performs better 
than other methods in scenarios (a) and (d) and competitively with SCC in (b) and (e). Scenario (c) is 
particularly challenging and in this case both PSC without any sparsity and K-subspaces perform quite 
poorly. Here the discriminative subspaces consist of two planes which are corrupted by noise and there is a 
high degree of overlap between them; moreover, they can only be identified by those 10 relevant variables. 
However, imposing sparsity in the solution helps PSC identify the correct variables, which in turn produces 
the best ARI. This scenario highlights one of the main limitations of the unpenalised PSC algorithm which 
have been overcome by its penalised version. 

Finally, we tested the ability of the PSC algorithm to perform model selection as described in Section 
14.41 using the same sparse simulated data as described previously. Table [5] shows the number of times that 
the correct K was selected, in each one of the five scenarios, over 100 Monte Carlo simulations. Both the 
PRESS and SOD methods were used to learn the number of clusters. As can be seen in the Table, the 
PRESS achieves a good performance in the selection of K in the first three scenarios, and its performance 
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Dataset 


is: 


AT 


P 


1 


Leukemia [11] 


3 


38 


999 


2 


CNS tumors [20] 


5 


48 


1000 


3 


Normal tissues [22] 


13 


99 


1277 


4 


St. Jude leukemia [34] 


6 


248 


985 


5 


Lung cancer [3] 


4 


197 


1000 


6 


Novartis multi-tissue [26] 


4 


103 


1000 



Table 6: Six gene expression data sets. 

decreases when the data contain more clusters. SOD performs similarly in scenarios (a) and (c); however, 
for all other scenarios, it performs relatively poorly. The bottom row in Table [5] shows the performance 
obtained in learning K when the dimensionality Rk is also learned using the PRESS. Although earning 
both the dimension of all subspaces and the number of clusters is a more difficult problem, the PRESS 
still provides satisfactory results. Remarkably, in this more difficult setting, the PRESS is still competitive 
or even superior than SOD in many scenarios. Some degradation in performance can be explained by the 
presence of noise causing the algorithm to mistakenly identify 1-dimensional subspaces as 2-dimensional. 
This is particularly important in setting (a) where the SOD method performs well. 

6 Applications to genomic data sets 

In order to test the proposed method on real data sets, we applied it to clustering biological samples in 
six publicly available gene expression datasets. DNA microarrays measure levels of thousands of mRNAs. 
PCA is routinely used for the analysis and visualisation of these biological measurements because the 
expression levels of tens of hundreds of genes in samples drawn from the same underlying population are 
often observed to be highly correlated [23] . The individual datasets used for our experiments have been 
obtained in a preprocessed form from the MIT Broad instituteQ In each of the datasets, the different classes 
correspond to different tumour or tissue types relating to different cancers. Each data set is characterised 
by the number of clusters, ranging from 3 to 13, the sample size, N, and the number of genes, P. A 
summary of these features is reported in Table [5] 

We compared the clustering performance obtained by PSC against the other state-of-art subspace 
clustering algorithms that were tested on simulated data sets. Of all the algorithms considered, both 
Mixture of Common t- Factor Analysers (MCtFA) and Sparse X-means (SKM) have been previously applied 
to clustering gene expression data, and SKM in particular is able to perform gene selection. The PSC 
algorithm was run using up to 3 latent factors, and 5 different levels of sparsity, where the number of 



2 http : //www . broadinstitute . org/cgi-bin/cancer /datasets . cgi 
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selected genes is between 10 and P. Clearly, when P genes are selected, no sparse solutions are imposed. 
The SKM algorithm is run using the same levels of sparsity and, even in this case, selecting P variables 
is equivalent to standard A-means. For A"-subspaces and SKM we use 50 random restarts and take the 
cluster configuration which minimises the within cluster sum of squares. SCC is run 50 times, each time 
using up to R = 3 dimensions and take the mean result. MCtFA is run with 50 restarts, the results which 
minimise the negative log likelihood are taken as final. 

In Tables [7] and |8] we report on the clustering accuracy and Adjusted Rand Index (ARI), respectively. 
Highlighted in each column is the best performance attained for each method. The PSC algorithm without 
sparsity and with two latent factors achieves a good clustering accuracy on all but one of the datasets 
indicating that samples belonging to different clusters can be well approximated by a two-dimensional 
linear subspace; these results also confirm that PSC is able to estimate this subspace accurately in high- 
dimensions. It can be seen that in all cases, maximum clustering accuracy can be achieved when some 
degree of sparsity is imposed. When R = 1, the best solution is achieved when 50 or fewer variables 
have been selected. For larger values of R, the best solution occurs when 100 or fewer variables have 
been selected. For all values of R, the non-sparse PSC algorithm never outperforms the best sparse PSC, 
which indicates that a certain number of genes in these data sets have a small contribution towards the 
determination of the various clusters. 

It is important to observe that standard (non-penalised) A-means performs almost equivalently to 
PSC on datasets 1 and 3 indicating that these clusters are well separated geometrically. However, as the 
level of sparsity is increased, the accuracy of A-means typically decreases. This is because the sparsity is 
estimated using all samples and selected genes are necessarily present in all clusters. PSC achieves better 
performance because it selects the important genes within each cluster individually. Compared to A-means, 
SCC achieves a greater clustering accuracy on all but datasets 1 and 3. Increasing the dimensionality of the 
subspaces does not greatly affect the clustering accuracy. The large difference in clustering performance 
on dataset 1 between SCC and PSC seems to suggest that PSC may perform better in higher dimensions. 

When one a one-dimensional subspace is extracted, A-subspaces achieves a good clustering accuracy 
and outperforms standard PSC on datasets 1,3 and 4. However, on these datasets, A-subspaces is also 
extremely sensitive to the number of dimensions. In all cases, the clustering performance decreases mono- 
tonically when the number of dimensions is increased, which makes the problem of selecting the best model 
particularly difficult. GPCA displays similar performance to A-subspaces, whereas MCtFA typically per- 
forms worse than PSC for all values of R. However for R > 1, MCtFA performs equivalently to PSC on 
datasets 4, 5 and 6. 

The effect of changing the number of latent factors in PSC, SCC, MCtFA and GPCA can also be 
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compared. For SCC, the effect on the clustering accuracy of changing R is typically not large. However, for 
PSC and MCtFA, the effect of changing the subspace dimensionality generally varies between datasets and 
is non-monotonic. This non-monotonic behaviour was also observed by [1] for values of up to R — 10. As 
noted before, this makes model selection particularly challenging. It could be argued that, for some datasets, 
increasing the number of latent factors beyond three might be helpful for some algorithms; although this 
may be the case, increasing the number of latent factors in each clyster requires the estimation of a much 
larger set of parameters, and this seems unnecessary in light of the good performance of PSC, GPCA 
and -ftf-subspaces with only a few latent factors. As the number latent factors increases, we also observe 
that the performance of GPCA decreases. The corresponding results using ARI as a performance measure 
highlights a larger disparity between competing methods and PSC, in support for the latter. In particular, 
the results suggest that these other methods may miss smaller clusters. This is particularly evident in 
dataset 3 which is especially challenging due to the presence of 13 clusters of varying sizes. 

We also attempted to learn the number of clusters in the six gene expression datasets using the PRESS 
based method described in Section |4~41 by setting R = 1 and selecting 10 variables. The results are reported 
in Table [9) It can be seen that the PRESS is able to correctly identify the true number of clusters in three 
of the datasets. In datasets 4 and 5 the number of clusters is underestimated by one. However, the PRESS 
is unable to correctly identify the true number of clusters in dataset 3. This is perhaps not surprising 
because this data set contains clusters of normal tissues, which are quite similar to each other, whereas the 
other datasets contain sub-types of cancerous tissues which show a much higher degree of separation. 

7 Conclusion 

In this work we have introduced an efficient approach to subspace clustering for high-dimensional data. 
Our algorithm relies on a new measure of influence for PCA, derived from an approximated PCA PRESS 
statistic, which can also be used in other applications unrelated to clustering to detect influential obser- 
vations. Compared to our initial work in |16) . here we have presented the relevant methodology in much 
greater detail, along with a number of extensions and additional empirical evaluations, including a proof 
of convergence of the PSC algorithm. 

A penalised version of the algorithm has also been introduced that can perform simultaneous subspace 
clustering and variable selection. For high-dimensional data, penalising the PCA solution in this way aids 
in the interpretability of the resulting partitions by identifying which variables contribute to each latent 
factor, and therefore which variables are important for explaining the directions of maximal variability 
in the data. Although the problem of variable selection in clustering has been discussed before (see, for 
instance, [32,), we are not aware of other subspace clustering algorithms which estimate sparse subspace 
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Data Set 






Algorithm 




1 


2 


3 


4 


5 


6 




10 


0.959 


0.932 


0.910 


0.969 


0.922 


0.954 




50 


0.963 


0.929 


0.938 


0.950 


0.888 


0.944 


PSC (1) 


100 


0.936 


0.887 


0.926 


0.946 


0.855 


0.903 




500 


0.963 


0.866 


0.908 


0.873 


0.707 


0.849 




P 


0.862 


0.830 


0.915 


0.875 


0.695 


0.833 




10 


0.911 


0.905 


0.923 


0.963 


0.815 


0.989 




50 


1.000 


0.914 


0.929 


0.961 


0.839 


0.980 


PSC (2) 


100 


1.000 


0.892 


0.923 


0.968 


0.866 


0.980 




500 


1.000 


0.858 


0.906 


0.949 


0.715 


0.980 




P 


1.000 


0.893 


0.918 


0.961 


0.621 


0.980 




10 


0.959 


0.894 


0.871 


0.962 


0.858 


0.990 




50 


1.000 


0.861 


0.911 


0.969 


0.845 


0.990 


PSC (3) 


100 


0.959 


0.877 


0.927 


0.980 


0.865 


0.990 




500 


0.963 


0.850 


0.879 


0.972 


0.816 


0.990 




P 


0.923 


0.854 


0.876 


0.953 


0.740 


0.990 




10 


0.812 


0.695 


0.871 


0.774 


0.547 


0.755 




50 


0.852 


0.720 


0.898 


0.777 


0.580 


0.765 


Sparse K-means 


100 


0.885 


0.734 


0.913 


0.776 


0.580 


0.765 




500 


0.959 


0.734 


0.916 


0.776 


0.582 


0.765 




P 


0.959 


0.734 


0.913 


0.776 


0.582 


0.765 


Tf-subspaces (1) 




0.918 


0.784 


0.932 


0.935 


0.660 


0.819 


if-subspaces (2) 




0.684 


0.761 


0.921 


0.814 


0.593 


0.747 


Tf-subspaces (3) 




0.558 


0.718 


0.877 


0.771 


0.563 


0.715 


SCC (1) 




0.675 


0.770 


0.865 


0.869 


0.517 


0.897 


SCC (2) 




0.671 


0.772 


0.864 


0.868 


0.594 


0.898 


SCC (3) 




0.672 


0.774 


0.865 


0.869 


0.598 


0.901 


MCtFA (1) 




0.652 


0.717 


0.630 


0.820 


0.645 


0.696 


MCtFA (2) 




0.693 


0.522 


0.642 


0.932 


0.732 


0.943 


MCtFA (3) 




0.879 


0.689 


0.725 


0.862 


0.922 


0.963 


GPCA (1) 




0.852 


0.836 


0.912 


0.905 


0.580 


0.803 


GPCA (2) 




0.822 


0.799 


0.903 


0.759 


0.562 


0.708 


GPCA (3) 




0.590 


0.706 


0.866 


0.755 


0.558 


0.746 



Table 7: Mean clustering accuracy of competing clustering algorithms on the six gene expression datasets 
described in table [5] 
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Data Set 






Algorithm 




1 


2 


3 


4 


5 


6 




10 


0.911 


0.778 


0.400 


0.909 


0.844 


0.874 




50 


0.919 


0.772 


0.586 


0.850 


0.776 


0.849 


PSC (1) 


100 


0.860 


0.655 


0.507 


0.841 


0.712 


0.739 




500 


0.919 


0.578 


0.407 


0.619 


0.424 


0.612 




P 


0.698 


0.460 


0.434 


0.636 


0.401 


0.547 




10 


0.911 


0.697 


0.495 


0.889 


0.634 


0.973 




50 


1.000 


0.732 


0.542 


0.881 


0.681 


0.946 


PSC (2) 


100 


1.000 


0.667 


0.466 


0.906 


0.735 


0.946 




500 


1.000 


0.577 


0.440 


0.849 


0.441 


0.946 




P 


1.000 


0.673 


0.493 


0.885 


0.639 


0.946 




10 


0.911 


0.675 


0.038 


0.888 


0.719 


0.973 




50 


1.000 


0.570 


0.370 


0.907 


0.693 


0.973 


PSC (3) 


100 


0.959 


0.613 


0.496 


0.942 


0.731 


0.973 




500 


0.963 


0.533 


0.293 


0.917 


0.637 


0.973 




P 


0.923 


0.548 


0.131 


0.865 


0.488 


0.973 




10 


0.605 


0.128 


0.236 


0.342 


0.115 


0.426 




50 


0.685 


0.203 


0.379 


0.345 


0.176 


0.421 


Sparse K-means 


100 


0.749 


0.212 


0.435 


0.355 


0.176 


0.421 




500 


0.911 


0.212 


0.457 


0.348 


0.181 


0.421 




P 


0.911 


0.212 


0.455 


0.348 


0.181 


0.421 


Tf-subspaces (1) 




0.824 


0.424 


0.682 


0.803 


0.333 


0.544 


if-subspaces (2) 




0.320 


0.269 


0.490 


0.431 


0.205 


0.356 


Tf-subspaces (3) 




0.030 


0.141 


0.289 


0.295 


0.140 


0.248 


SCC (1) 




0.291 


0.302 


0.173 


0.579 


0.215 


0.717 


SCC (2) 




0.289 


0.299 


0.183 


0.581 


0.210 


0.730 


SCC (3) 




0.310 


0.316 


0.193 


0.571 


0.203 


0.705 


MCtFA (1) 




0.367 


0.226 


0.034 


0.708 


0.149 


0.429 


MCtFA (2) 




0.830 


0.317 


0.055 


0.809 


0.598 


0.921 


MCtFA (3) 




0.748 


0.213 


0.085 


0.759 


0.844 


0.899 


GPCA (1) 




0.692 


0.540 


0.451 


0.725 


0.181 


0.514 


GPCA (2) 




0.607 


0.373 


0.339 


0.424 


0.143 


0.261 


GPCA (3) 




0.112 


0.087 


0.09 


0.318 


0.133 


0.372 



Table 8: Mean ARI of competing clustering algorithms on the six gene expression datasets described in 
table |U 
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Dataset 


Real K 


Estimated if 


1 


3 


3 


2 


5 


5 


3 


13 


6 


4 


6 


5 


5 


4 


5 


6 


4 


4 



Table 9: Real vs learned number of data clusters obtained by PCS with the PRESS using only the first 
principal component on the six gene expression data sets. 



parameters. Furthermore, more structured penalties could be used instead of the simple l\ penalty [10) . 
For instance, non- negative cluster- wise parameters may be more appropriate in some situations [31] . 

Extensive simulation experiments have been presented here that compare a number of subspace clus- 
tering algorithms. For these experiments, we have selected challenging scenarios in which the data within 
each cluster have low-dimensional representations that need to be identified, including the case where these 
representations are sparse. Our results indicate the SPC is particularly competitive in a wide range of sit- 
uations. Apart from comparisons on artificially constructed data, we have also tested whether a subspace 
approach is beneficial in clustering biological samples for which a thousand gene expression levels have 
been measured. Our empirical evaluation using six different publicly available data sets suggest that, al- 
though the clusters in some data sets might be discovered using more traditional algorithms that exploits 
geometric structures, subspace clustering is very useful in many other datasets due to the fact that gene 
expressions levels within each cluster can be approximated well by a few principal components. In our 
experiments, PSC always appears to be very competitive and, in several situations, has also been shown 
to out-performs other competing methods. Apart from gene expression data, the PSC algorithm had been 
previously shown to perform particularly well in clustering digital images of human faces collected under 
different lighting conditions for which a cluster- wise PC A reconstruction is also appropriate |16j . 

The PSC algorithm maintains some similarity to the isT-subspaces algorithm. As with if-subspaces, we 
itcratively fit cluster-wise PCA models and reassign points to clusters until a certain optimality condition is 
met. However, rather than trying to minimise the residuals under the individual PCA models, we introduce 
an objective function that exploits the predictive nature of PCA in a way that makes it particularly robust 
against overfitting. Along with the PRESS statistic, the PSC is able to learn both the number of clusters and 
the dimensionality of each subspace, although this is a particularly difficult problem and more investigation 
is required. The difficult issue of selecting the correct level of sparsity within each subspace will also be 
explored further in further work. 
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A Derivation of predictive influence 

Using the chain rule, the gradient of the PRESS for a single latent factor is 

a/« i d (1)||2 1 (1) d (1) 

= We • = — e . e . 

dxi 2dxi" " l " 2 ' l dxi ~ 1 ' 

For notational convenience we drop the superscript in the following. Using the quotient rule, the partial 
derivative of the i th leave-one-out error has the following form 

1 dxi 



d 7^(1 -/>i)+e,f|i 



dx- e -' (i - h t y 

which depends on the partial derivatives of the i th reconstruction error and the hi quantities with respect 
to the observation a^. The computation of these two partial derivatives are straightforward and are, 
respectively 

- — ei = -= — Xi Up - vv T ) = (ip - vv T ) , 

OXi OXi 

and 

d d 

— — hi = — — XivDv T xJ = 2vDdi. 

OXi OXi 

The derivative of the PRESS, J with respect to Xi is then 

1 d .. ||2 d (l P -vv T ){l-hi)+2eivDdi 

(28) -— He-ill 2 = e_ i —e_ i = e_/ 



2 dxi dxi (1 — hi) 2 

However, examining the second term in the sum, eivDdi, we notice 

eivDdi — (xi — XiW T )vDdi = XivDdi — XiW T vDdi = 0. 



Substituting this result back in Eq. (|28|) . the gradient of the PRESS for a single PC A component with 

respect to Xi is given by 

10 a (I P - vv T ) (1 - hi) (l P -vv T ) 

1 e_i =e- t - f-^ = e_i 



2dxi" ,n (l-^) 2 (1-hi) ■ 

In the general case for R > 1, the final expression for the predictive influence Tc(xi) € M Pxl of a point 
a;, under a PCA model then has the following form: 

/ R ( j — v ( r ) v ( r ) 



,r=l 



1 - /i 



B Proof of Lemma [T] 

From Appendix El for R=l, the predictive influence of a point 7r(a:i; i>) is 
(29) <Xi . v) = _^_ 
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This is simply the i th leave-one-out error scaled by 1 — hi. If we define a diagonal matrix S € R NxN with 
diagonal entries E* = (1 — hi) 2 , we can define a matrix II <E E ArxP whose rows are the predictive influences, 
II = [tv(xi;v) t , . . . ,7r(a;jv; f) T ] T - This matrix has the form 

U=S- 1 (X-Xvv T ). 

Now, solving (|21[) is equivalent to minimising the squared Frobenius norm, 



(30) min 1rl(X- Xvv T ) ET 2 (X - Xvv 1 

subject to ||i>|| = 1. 

Expanding the terms within the trace we obtain 



.vv Ts 



Tr ((X - Xvv T ) T S- 2 (X - Xi 
= Tr (X T 3~ 2 X) - 2Tr (vv T X T S- 2 X) + Tr (vv T X T S- 2 Xvv T ) . 

By the properties of the trace, the following equalities hold 

Tr(vv T X T E- 2 X) =v T X T *- 2 Xv, 
and 

Tr (vv T X T 3- 2 Xvv T ) = Tr (3- 1 Xvv T vv T X T 3- 1 ) 

= v T X T a-' 2 Xv, 

since S is diagonal and v T v = 1. Therefore, ([5U]) is equivalent to 

(31) minTrX T H- 2 X - v T X T E,~ 2 Xv, 

subject to ||t>|| = 1. 

It can be seen that under this constraint, Eq. (|3ip is minimised when v T X T a~ 2 Xv is maximised which, 
for a fixed S is achieved when v is the eigenvector corresponding to the largest eigenvalue of X T a~ 2 X . 

C Proof of Lemma [2] 

In this section we provide a proof of Lemma [2] As an additional consequence of this proof, we develop an 
upper bound for the approximation error which can be shown to depend on the leverage terms. We derive 
this result for a single cluster, C^ however it holds for all clusters. 
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We represent the assignment of points i = 1, . . . , N to a cluster, C^ T > using a binary valued diagonal 
matrix A whose diagonal entries are given by 

( 1, iiieC^ 

(32) At = \ 

[ 0, otherwise, 

where Tr(A) = JV&. We have shown in Lemma [1] that for a given cluster assignment, the parameters which 
optimise the objective function can be estimated by computing the SVD of the matrix 

(33) J2 x i S 7 2 *i = X T S- 2 AX, 

within each cluster where the i th diagonal element of H is E< = (1 — hi) 2 < 1, so that Er 2 > 1. We can 
then represent H~ 2 = Jjy + $ where $ € IR" X " is a diagonal matrix with entries $^ = fa > 0. Now, we 
can represent Eq. (|33[) at the next iteration as 

(34) M =X T A{I N + *)X. 

We can quantify the difference between the optimal parameter, v* obtained by solving (122(1 using Af 
and the new PCA parameter estimated at iteration r + 1, v^ as, 

E(S*,S {T) ) = v* T M^v* - v^ T X T AXv^ T \ 

where v^ is obtained through the SVD of X T AX. We can express E(S*,S^) in terms of the spectral 
norm of M. Since the spectral norm of a matrix is equivalent to its largest singular value, we have 
V ( T ) X T AXv 1 - ^ = \\X T AX\\ Since $ is a diagonal matrix, its spectral norm, ||$|| = max($). Similarly, 
A is a diagonal matrix with binary valued entries, so \\A\\ = 1. 

E(S*,S (T) ) < \\M - X T AX\\ 
< \\X T A<f>X\\ 

(35) <max($)||X T X||. 

Where the triangle and Cauchy-Schwarz inequalities have been used. In a similar way, we now quantify 
the difference between the optimal parameter and the old PCA parameter iA T ~ 1 ), 

E(S*,S {T ~ 1) ) = v* T Mv* - v ( - T - 1 ^X T AXv^- 1 '>. 

Since v^ is the principal eigenvector of X T AX, by definition, v^ X T AXv^ is maximised, therefore 
we can represent the difference between the new parameters and the old parameters as 

E(S ( r\S ( r- 1 '>) = v (t ^ T X T AXv^ - v^' 1 ^ X T AXv^' 1 ^ > 0. 
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Using this quantity, we can express E(S*,S^ T ^) as 

E(S*,S<> T -V) <\\M\\ - v^- 1 ^ X T AXv {T -^ 

<\\X T ®AX\\ + ||X T AX|| - v {T -^ T X T AXv^- 1 ^ 

(36) <max(*)i|X T X|| +S(5 (r) ,5 (T - 1) ), 

From Eq. ([55)) and ([33]) it is clear that 

(37) E{S*, S {T) ) < E{S*,S( T -V). 

This proves Lemma [2j 

The inequality in Eq. (|3"?| implies that estimating the SVD using X T AX obtains PCA parameters 
which are closer to the optimal values than those obtained at the previous iteration. Therefore, estimating a 
new PCA model after each cluster re-assignment step never increases the objective function. Furthermore, 
as the recovered clustering becomes more accurate, by definition there are fewer influential observations 
within each cluster. This implies that max($) — > 0, and so E(S* ,S^) — > 0. 
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